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Abstract 

In this paper we present a practical solution with performance guarantees to the problem of di¬ 
mensionality reduction for very large scale sparse matrices. We show applications of our approach to 
computing the low rank approximation (reduced SVD) of such matrices. Our solution uses coresets, 
which is a subset of 0(fe/e^) scaled rows from the n x d input matrix, that approximates the sub of 
squared distances from its rows to every fc-dimensional subspace in up to a factor of lie. An 

open theoretical problem has been whether we can compute such a coreset that is independent of the 
input matrix and also a weighted subset of its rows. We answer this question affirmatively. Our main 
technical result is a novel technique for deterministic coreset construction that is based on a reduction 
to the problem of li approximation for item frequencies. 


1 Introduction 

Algorithms for dimensionality reduction usually aim to project an input set of d-dimensional vectors (database 
records) onto a fc < d — 1 dimensional affine subspace that minimizes the sum of squared distances to these 
vectors, under some constraints. Special cases include Principle Component Analysis (PCA), Linear re¬ 
gression {k = d — 1), Low-rank approximation (fc-SVD), Latent Drichlet Analysis (LDA) and Non-negative 
Matrix Factorization (NNMF). Learning algorithms such as A:-means clustering can then be applied on the 
low-dimensional data to obtain fast approximations with provable guarantees. 

However, there are still no practical algorithms with provable guarantees that compute dimension re¬ 
duction for sparse large-scale data. Much of the large scale high-dimensional data sets available today (e.g. 
image streams, adjacency matrices of graphs and social networks, text streams, etc.) are sparse. For ex¬ 
ample, consider the text case of the English Wikipedia. We can associate a matrix with the Wikipedia, 
where the (usually English) words define the columns (approximately 8 millions terms) and the individual 
documents define the rows (approximately 3.7 million documents). This large scale matrix is sparse because 
every document uses a very small fraction of the possible English words. 

In this paper we present an algorithm for dimensionality reduction of very large scale sparse data sets 
such as the Wikipedia with provable approximations. Our approach uses coresets to solve the time and space 
challanges. 

Given a matrix A, a coreset C in this paper is defined as a weighted subset of rows of A such that the 
sum of squared distances from any given fc-dimensional subspace to the rows of A is approximately the same 
as the sum of squared weighted distances to the rows in C. Eormally, 

Definition 1 ((e, fc)-Coreset). Given a n x d matrix A whose rows ai,--- ,an are n points (vectors) in 
an error parameter e G (0,1], and an integer k G [l,d — 1] that represents the desired dimensionality 
reduction, an (e, k)-coreset for A is a weighted subset C = {wiQi | Wi > 0, z S {1, • • • , n}} of the rows of A, 
where w = (u>i, • • • ,Wn) G [0, oo)" is a non-negative weight vector such that for every k-subspace S in we 
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have 


^(dist(aj, S))'^ - ^(dist(wiaj, S'))^ 

i i 

< e y^(dist(ai, S)Y. 

i 

That is, the sum of squared distances from the n points to S approximates the sum of squared weighted 
distances w^(dist(ai, S))'^ to S. The approximation is up to a multiplicative factor of lie. By choosing 
w = (1, • • • , 1) we obtain a trivial (0, fc)-coreset. However, in a more efficient coreset most of the weights 
will be zero and the corresponding rows in A can be discarded. The cardinality of the coreset is thus the 
sparsity of w 

\C\ = Ikllo := I {wi 7^ 0 I i G {l,- - • ,n}} |. 

If C is small, then the computation is efficient. Because C is a weighted subset of the rows of A, if A is 
sparse, then C is also sparse. A long-open research question has been whether we can have such a coreset 
that is both of size independent of the input dimension {n and d) and a subset of the original input rows. 
In this paper we answer this question affirmatively as follows. 

Theorem 1. For every input matrix A G a parameter error e G (0,1] and an integer k > 1, there is 

a {k,e)-coreset of size \C\ = 0{kje^); see Definition p]). 

Our proof is constructive and we provide an efficient algorithm for computing such a coreset C. 

1.1 Why Coresets? 

The motivation for using data reduction technique, and coresets as defined above in particular, is given 
below. 

Fast approximations. The fc-subspace that minimizes the sum of squared distances to the vectors in C, 
will minimize the sum of squared distances to the rows of A, up to a factor of (lie), over every fc-subspace 
in R'^. If the e-coreset C is small, it can used to efficiently compute the low-rank approximation (reduced 
SVD) of the original matrix. More generally, using an e-coreset we can compute the optimal fc-subspace S 
under some given constraints, as desired by several popular dimensionality reduction techniques. 

Streaming and parallel computation. An algorithm for constructing an e-coreset off-line on a single 
machine, can be easily turned into an algorithm that maintains an e-coreset C for a (possibly unbounded) 
stream of d-dimensional input vectors, oi, • • • , a„ where n is the number of vectors seen so far, using one pass 
over these vectors and OdCj logn) memory. Using the same merge-and-reduce technique, we can compute 
C in a way known as embarrassingly parallel on distributed M machines on a cloud, network or GPGPU, 
and reduce the construction time of C by a factor of M. In both the streaming and parallel models, the 
size of the coreset C will be increased where e is replaced by 0{s/ logn) in the memory and running time 
computations. For our specific technique of the Frank-Wolfe algorithm, similar streaming versions do not 
introduce the logn factor. 

Applications to sparse data. Let s denote the sparsity, i.e., maximum non-zero entries, in a row of 
A, over all of its n rows. Note that this definition is different than the more common definition of number of 
non-zeroes entries (nnz) in the entire matrix. The memory (space) that is used by C is then OdCj -m) words 
(real numbers). In particular, if the cardinality of C is independent of the dimensions n and d of A, then so 
is the required memory to store C and to compute its low rank approximation. This property allows us to 
handle massive datasets of unbounded dimensions and rows, with the natural limitation on the sparsity of 
each row. 

Sparse reduced SVD. A lot of algorithms and papers aim to compute a low rank approximation which 
can be spanned by few input points. In particular, if each input point (row) is sparse then this low rank 
approximation will be sparse. For example, in the case of text mining, each topic (singular vector) will be a 
linear combination of few words, and not all the words in the dictionary. We get this property “for free” in 
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the case of coresets: if the coreset consists of few input rows, its optimal fc-subspace (fc-rank approximation) 
will also be spanned by few input points. The optimal fc-subspace of the sketches presented in Section 11.31 
do not have this property in general. 

Interpretation and Factorization. Since coreset consists of few weighted rows, it tells us which records 
in the data are most importance, and what is their importance. For example, a coreset for the Wikipedia 
tells which documents are most important for approximating the reduced SVD (Latent Semantic Analysis). 
This and the previous property are the motivations behind the well known CX and CUR decompositions, 
where C is a subset of columns from the transpose input matrix , R is a subset of rows from . A 
coreset represents an even simpler WA decomposition, where IT is a diagonal sparse matrix that consists of 
the weights wi, • • • ,?u„ in its diagonal, or a DR decomposition where I? is a diagonal matrix that consists 
of the non-zero rows of W, and i? is a subset of rows from A. 

1.2 Our contribution 

Our main result is the first algorithm for computing an (e, fc)-coreset C of size independent of both n and d, 
for any given n x d input matrix . Until today, it was not clear that such a coreset exists. The polynomial 
dependency on d of previous coresets made them useless for fat or square input matrices, such as Wikipedia, 
images in a sparse feature space representation, or adjacency matrix of a graph. The output coreset has all 
the properties that are discussed in Section [Ol We summarize this result in Theorem [TJ 

Open problems. We answer in this paper two recent open problems: coresets of size independent of 
d for low rank approximation [7] and for 1-mean queries [2]. Both of the answers are based on our new 
technical tool for coreset construction; see Section [21 

Efficient construction. We suggest a deterministic algorithm that efficiently compute the above coreset 
for every given matrix. The construction time is dominated by the time it takes to compute the fc-rank 
approximation for the input matrix. Any such (e, fc)-coreset can be computed for both the parallel and 
streaming models as explained in [5], or using existing streaming versions for the Frank-Wolfe algorithm. 

Efficient Implementation. Although our algorithm runs on points in R‘^ that represent dxd matrices, 
but we show how it can actually be implemented using operations in similarly to learning kernels 
techniques. The running time of the algorithm is dominated by the time it takes to compute a constant factor 
approximation to the fc-rank approximation of the input matrix, and the time it takes to approximates the 
farthest input point from a given query point in the input. Over the recent years several efficient algorithms 
were suggested for these problems using pre-processing, in particular for sparse data, (e.g. [3], Fast JL and 
farthest nearest neighbour). 

1.3 Related Work 

Existing Software and implementations. We have tried to run modern reduced SVD implementations 
such as GenSim m that uses random projections or Matlab’s svds function that uses the classic LAPACK 
library that in turn uses a variant of the Power (Lanczoz) method for handling sparse data. All of them 
crashed for an input of few thousand of documents and fc < 100, as expected from the analysis of their 
algorithms. Even for fc = 3, running the implementation of svds in Hadoop [12] (also a version of the Power 
Method) took several days [H] . 

Coresets. Following a decade of research, in m it was recently proved that an (e, fc) coreset of size 
ICj = 0{dk^/£^) exists for every input matrix. The proof is based on a general framework for constructing 
different kinds of coresets, and is known as sensitivity Hi- This coreset is efficient for tall matrices, since 
its cardinality is independent of n. However, it is useless for “fat” or square matrices (such as the Wikipedia 
matrix above), where d is in the order of n, which is the main motivation for our paper. In [2], the Frank- 
Wolfe algorithm was used to construct different types of coresets than ours, and for different problems. Our 
approach is based on a solution that we give to an open problem in [2], however we can see how it can be 
used to compute the coresets in [2] and vice versa. 

Sketches. A sketch in the context of matrices is a set of vectors ui, • • • , Ug in such that the sum of 
squared distances from the input n points to every fc-dimensional subspace S in can 
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be approximated by ‘^))^ '^P ^ multiplicative factor of 1 ± £. 

Note that even if the input vectors ai, • • • , a„ are sparse, the sketched vectors ui, ■ • • , Ms in general are 
not sparse, unlike the case of coresets. A sketch of cardinality d can be constructed with no approximation 
error (e = 0), by defining mi, • • • , m^ to be the d rows of the matrix DV"'^ where UDV"'" = A is the SVD of A. 
It was proved in that taking the first 0{k/e) rows of DV'^ yields such a sketch, i.e., of size independent 
of both n and d. Using the merge-and-reduced technique, this sketch can be computed in the streaming and 
parallel computation models. The final coreset size will be increased by a factor of O(logM) in this case. 
For the streaming case, it was shown in uniE] that such a strong sketch of the same size, 0{k/e), can be 
maintained without the additional O(logn) factor. The first sketch for sparse matrices was suggested in [3], 
but like more recent results, it assumes that the complete matrix fits in memory. 

Lower bounds. Recently, [3 [ID] proved a lower bound of 0{k/e) for the cardinality of a strong sketchs. 
A lower bound of 0{k/e^) for strong coreset was proved for the special case fc = d — 1 in [1]. 

The Lanczoz Algorithm. The Lanczoz method and its variant as the Power Method are based on 
multiplying a large matrix by a vector for a few iterations to get its largest eigenvector vi. Then the 
computation is done recursively after projecting the matrix on the hyperplane that is orthogonal to vi. 
Multiplying a matrix by a vector can be done easily in the streaming mode without having all the matrix in 
memory. The main problem with running the Lanczoz method on large sparse data is that it is efficient only 
for the case k = 1: the largest eigenvector vi of a sparse matrix A is in general not sparse even A is sparse. 
Hence, when we project A on the orthogonal subspace to vi, the resulting matrix is dense for the rest of the 
computations {k > 1). Indeed, our experimental results show that the sparse SVD function in MATLAB 
which uses this method runs faster than the exact SVD, but crashes on large input, even for small k. 

2 Novel Coreset Constructions 

Our main technical result is a new approach that yields coresets of size significantly lower than the state- 
of-the-art. While this paper is focused on coresets for computing the reduced SVD, our approach implies a 
technique for constructing coresets that may improve most of the previous coreset constructions. 

Recall that coresets as defined in this paper must approximate every query for a given set of queries. 
In this paper the queries are A:-subspaces in Unlike sketches, coresets must also be weighted subsets of 
the input. In this sense, all existing coresets constructions that we aware of can be described as computing 
sensitivity (or leverage score, or importance) for each point, as described earlier, and then computing what 
is known as an e-sample; see [3]. For a given set A of size n, and a set of queries (subsets) of A, a subset 
S' C A is an e-sample if for every query, the fraction / of A that is covered by the query is e-approximated 
by the fraction / that it covers from S, i.e, 

1 /-/!<£■ ( 1 ) 

Such e-sample can usually be constructed deterministically in time that is exponential in the pseudo¬ 
dimension d of A (a complexity measure of shapes that is similar to VC-dimension a) , or using uniform 
random sampling of size the is linear d. The second approach yields randomized constructions which is 
known to be sub-optimal compared to deterministic ones [1]. 

In this section we suggest a new gradient descent type of deterministic construction of coresets by reducing 
coreset problems to what we call Item Frequency £ 2 -Approximation. The name comes from the well known 
Item Frequency Approximation (IFA) problem that was recently used in |10j to deterministically compute a 
sketch of size 0{k/e) for the reduced SVD problem. Unlike the e-sample and previous deterministic approach, 
or approach yields deterministic coresets of size independent of the pseudo-dimension d. This property will 
turn into coresets of size independent of d for the reduced SVD problem and also for the simpler 1-mean 
coreset that will be described below. 

To show the relation to o, we present the IFA problem a bit differently than in existing papers. In the 
IFA problem there is a universe of n items / = {ei, • • • , e^} and a stream ai, • • • , a„ G / of item appearances. 
The frequency fi of an item ai stands for the fraction of times appears in the stream (i.e., its number of 
occurrences divided by n). It is trivial to produce all item frequencies using 0{d) space simply by keeping a 
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counter for each item. The goal is to use 0(l/e) space and produce approximate frequencies /i, • • • , fn such 
that 

\f^- fi\ < £■ 

Note the surprising similarity to £-samples in O- 

As stated there [inmi, IFA is the fundamental tool that was used to construct sketches of size 0{k/s). 
However, it was also proved in [7] that this approach cannot be used to construct coresets (weighted subsets). 
This was suggested there as an open problem. In this paper we solve this problem by generalizing the IFA 
problem as follows. 

IFA equals £inf-IFA. First observe that the input vectors for the IFA problem can be considered as the 
rows of the d x d identity matrix /. We then wish to approximate each entry of the resulting mean vector 
by a small weighted subset, 

max|/j - fi\ = 

I 

where w G [0, c»)" is a non-negative sparse vector of weights, and ||(a:i, • • • , Xn)||oo = maxi \xi\ for x G R" is 
known as the norm. More generally we can replace / by all the unit vectors in 

.^ 2 -IFA. In this paper we define the version of IFA as the problem of computing a sparse non-negative 
vector w G [0, oo)" such that 

Qi 


where ||(a;i, • • • ,x „)||2 = known as the ^ 2 , Frobenius, or Euclidean norm. 

Let ZI” denote the union over every vector z G [0,1]" that represent a distribution, i.e., Zi = 1. Our 
first technical result is that for any finite set of unit vectors ai,--- , 0 ^ in R'^, any distribution z G IZ", 
and every e G (0,1], we can compute a distribution w G ZI" of sparsity (non-zeroes entries) ||w||o < 1/e^. 
This result is in fact straight-forward by applying the Frank-Wolfe algorithm [5] , after normalizing the input 
points. 

Theorem 2 (Item Frequency £2 Approximation). Let z G ZI" be a distribution over n vectors oi, • • • , a„ in 
R^^. There is a distribution w G ZI" of sparsity ||w||o < 1/e^ such that 

II ^Zj • Ci - '^Wiai\\2 < e. 

i i 

The Caratheodory Theorem proves Theorem [5] for the special case e = 0 using only d + 1 points. 
Our approach and algorithms for Z' 2 -IFA can thus be considered as ^-approximations for the Caratheodory 
Theorem, to get coresets of size independent of d. Note that a Frank-Wolfe-style algorithm might run more 
than (Z-l-1 or n iterations without getting zero error, since the same point may be selected in several iterations. 
Computing in each iteration the closest point to the origin that is spanned by all the points selected in the 
previous iterations, would guarantee coresets of size at most d -I- 1, and fewer iterations. Of course, the 
computation time of each iteration will be much slower in this case. Due to lack of space we omit further 
details. 

2.1 Warm up: reduction to 1-mean s-coresets. 

Given a set oi, • ■ • , a„ of n points in R"^, and e G (0,1), we define a 1-mean e-coreset to be a weight vector 
w G [0, 00 )" such that the sum of squared distances II®* “ ®IP every center c G R'^ to these input 

points is the same as the weighted sum X]r=i ®’*ll®* of the weighted points, up to a (I±e) multiplicative 
factor. 

Using the sensitivity framework [4], we can compute such a coreset that has size 0{d/£^) with high 
probability [^. A corollary to this also answers the open problem in [2]. Although the approximation 
property should hold for every center c, we reduce the I-mean coreset problem to the £2 IFA problem, whose 


WiGi 


< £, 


Ei®* 




W^Oi 
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solution depends only on the input points. By scaling and normalizing the input points, we prove that 
Theorem!^ can be used to deterministically compute the first coreset of size independent of d for the 1-mean 
problem in time that is not exponential in 1/e. 

Corollary 1 (1-mean e-coreset). For every set of of n vectors in and every e G (0,1) a 1-mean e-coreset 
of size 0(l/e^) can be computed deterministically in 0(nd/e) time. 


3 Reduction from Coreset to ^ 2 -Item Frequency Approximation 


In the previous section we suggested the problem of £ 2 -IFA and several algorithms to solve it. We then 
suggested a new construction for 1-mean coreset that is based on a simple reduction to the .^ 2 -IFA problem 
using a scaled versions of the input points. In this section we prove a more involved reduction from the main 
problem of this paper: computing a (/c, e)-coreset for the reduced SVD problem of a matrix A. In this case 
the input to the .^ 2 -IFA problem is a set oi n d x d matrices, i.e., vectors in Each such matrix has the 

form VivJ where Vi is related to the ith row of the U matrix in the Singular Value Decomposition UDV"^ 
of A. The proof of Theorem [T] then follows by bounding the right hand side of @ using Theorem [5] This 
is done by normalizing the input set, where the mean is translated and scaled to be the vector (1, • • • , 0). 
To get an e-additive error after this scaling, the number of iterations is at most the sum of the norms of the 
matrices VivJ divided by e^, which yields a coreset of size fc/e^. 

Notation. We denote by the set of all n x d matrices. For a given integer i > 1 we denote 

[n] = {I,--- ,n}. For a pair of indexes (integers) i £ [n], j £ [d] and a given matrix X £ R"^'^ we denote 
by Xij its entry in the Ah row and jth column. As in Matlab, we denote by / : k the set of indexes 
{j,j -I- 1, • • • ,k} for an integer k > j. The Ah row of X is denoted by £ R^^'^ and the jth column by 
X:j e R"^^. The Ah entry of a vector x = (xi, ■ ■ ■ , Xd) is denoted by Xi. 

The Frobenius norm (root of squared entries) of a matrix or a vector X is denoted by ||A||i? = ||A|| = 

identity matrix is denoted by / and the matrix whose entries are all zeroes is denoted 
by 0. The size of I and 0 is determined by the context. The inner product of a pair of matrices X,Y £ R"^'^ 
is denoted by A • V ^Yh=i - 

We are now ready to prove the reduction from coresets to the Item Frequency £2 Approximation problem, 
as follows. Theorem [T] then follows by using 

Claim 1. Let A £ R"^'^ he a matrix of rank d, and let = A denote its full SVD. Let k £ [1, d — 1] be 

an integer, and for every i £ [n] let 


Vi = 



.U, 




i, 





( 2 ) 


Let W £ R"^" be a diagonal matrix. Then for every X £ such that X^X = I we have 


||WAA||2 


< 5- 


2=1 


Proof. Let e = || “ ^i,i)vivf \\. For every i £ [n] let ti = 1 — Put X £ k) 

X^X = I. Without loss of generality we assume = J, i.e., A = UT,, otherwise we replace X by V^X. 
It thus suffices to prove that 


\j2u\\A,xr\<54Axf. 


(3) 


2 


6 







Using the triangle inequality, 


\J 2 U\\A^,X\ 


< 


i i 

+ \j2u\\{Ai:k,o)xr\. 


(4) 

(5) 


We now bound the last two expressions. 

Bound on It was proven in [5] that for every pair of fc-subspaces 81,82 in there is u > 0 and 
a (fc — l)-subspace T C 81 such that the distance from every point p G 81 to 82 equals to its distance to T 
multiplied by u. By letting 5'i denote the fc-subspace that is spanned by the first k standard vectors of 
letting 82 denote the fc-subspace that is orthogonal to each column of X, and y € be a unit vector that 
is orthogonal to T, we obtain that for every row vector p 


\\{p,0)Xr = u^{pyr. 


( 6 ) 


After defining x = ^i-.k,i-.ky/\\^i-.k,i-.ky\\, ® is bounded by 


^ U\\iA,,,k,0)Xf = 

i i 

i 

— ^ ^ ^2 II ^2,l:fc^l:fc,l:fcy II 

i 

= (7) 

i 

The left side of ® is bounded by substituting p = Syiife in ® for j G [k], as 

k k 

U^\\^l:k,l:kyf = ^ ||(E,-i,fe,0)X||2 

1=1 i=i 

i=i i=i 

= llEXf = ||C7EXf = IIAXf. (8) 

The right hand side of 0 is bounded by 

\^ti\\iUi,l,k)x\\‘^\ = I ti(Ui^i:k)'^Ui^i:k • xx'^l 

i i 

— \xx * ^ ^ ^2 {JJi,l:k') ^2,l:fc| 

i 

^ * \\^U{Uij:k)'^Ui^i:k\\ ( 9 ) 

i 

^ \\^UiVi^l-.k)'^Vi,l:k\\ < W^UvfviW = S, ( 10 ) 

i i 

where m is by the Cauchy-Schwartz inequality and the fact that ||a;a:^|| = ||ai|p = 1, and in (nni) we used 
the assumption Aij = Ui^aj = Vij for every j G [A]. 
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Plugging ([5]) and (ITUl) in ([7]) bounds ([5]) as 


\Y,UKA^.v.k, 0 )xr\<e\\AXr. (11) 

i 

Bound on (|4]): For every i € [n] we have 

\\A,xf-\\{A,,..k,o)xf 

= 0 )XX^( 0 , Ai^k+l-.d)'^ 

+ II ( 0 , Ai^k+i-.d^XW"^ 

= 2Ai^i.kXi:k,-.{Xk+l-.d,-.)'^{Ai^k+l-.d)'^ 

+ ||(0,^i,fe+l:<i)-^|p 

( 12 ) 


= 2 ^ AijXj^-XXk+i-.d,-.)'^ (Ai^k+i-.d)'^ 
i=i 

+ 11(0, Ai_fc+i:c()X|p 

k 

i=i 

+ ||o'fe+l:d|P||(0,'yi,fc+i:d)-’^|P- 

Summing this over i G [n] with multiplicative weight ti and using the triangle inequality, will bound (|^ 

by 


Y,U\\A,,Xf 

i i 

k 

i j=l 

■ ll^fc+litzIl'Cjj (■Ui^fc-|-i:rf) 


y^ti|kfc+l:d|P||(0,^i,fc+l:d)^||' 


The right hand side of (fT^ is bounded by 


^ ^ : (Jffc-|-i:e/) * \\o'k+l:d\\^^ti'^i,j(j^i,k+l:d^ 

i=l i 

k 

— 2f7j 11^^-^. II ■ ||fTfc+l:(j|| II ^ II 

3^1 i 

< ^{sa]\\X,,r + ll^^+^-^ll' ||^t.u.,,-u.,fe+i,rff) 

1 ^ 
j=i I 

<2e\\AXr, 


(13) 


(14) 


(15) 

(16) 


(17) 










where (Ha is by the Cauchy-Schwartz inequality, dT51) is by the inequality 2 ab < + b"^. In (ITTl) we used 

the fact that Vi^k+i-.d is a block in the matrix tiVivf, and 

k 

\\ak+i-.df <\\AXf and 

( 18 ) 

= \\^l:k,l:kXi..k,f < IlSXf < IIAXf. 

Next, we bound (fHl) . Let Y G such that = / and = 0. Hence, the columns of F span the 
fc-subspace that is orthogonal to each of the {d — k) columns of X. By using the Pythagorean Theorem and 
then the triangle inequality, 


lkfe+l:d|p| ^ij||(0,'!;i,fc+i:d)X|p| (19) 

i 

= lkfe-|-l:(i|| I k i II (0; Vi,k+l-.d)\\ 

i 

-Y,u\mv.,k+i:d)Yr\ 

i 

^i||'^'i,fc+l:c/|| I (20) 

i 

+ \\ak+i-.df\Y.U\\{0,v^,k+i-.d)Yf\. ( 21 ) 

i 

For bounding m, observe that F corresponds to a (d — k) subspace, and {0,Vi^k+i-.d) is contained in the 
{d — k) subspace that is spanned by the last (d — k) standard vectors. Using same observations as above (O, 
there is a unit vector y G such that for every i G [n] ||(0,Ui^fe+i:d)F|p = \\{vi^k+i-.d)y\\‘^■ Summing this 

over ti yields. 


\Y^U\\{0,Vi,k+i-.d)Y\\^\ = \ Y^U\\v^,k+i-.dyf\ 

i i 

i j—k-\-l j—k-\-l i 

Replacing in (HI]) by the last inequality yields 


kfc-l-l:d n l(0,r’i,fe+l:d)X P 

i 


d 

< Ikfc+ndf+ 51 y]-k\\^UvivJ\\) 

(22) 

i j—k-\-l i 


d 

< lkfc+i:df (e + e 5] d|-fc) < 2emxf, 

(23) 


j=k+l 


where ((^ follows since ^ ■ tivf j is an entry in the matrix ^ ■ tiVivf, in (1^ we used (fT51) and the fact that 
IklP = 1. Plugging (fTTll in (IT^ and (1^ in(I5|) gives the desired bound on (|1|) as 

i i 

Finally, using CD in (d and the last inequality in (|3|), proves the desired bound of (0. □ 
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Algorithm 1 SVD-Coreset(A, fc, e) 

Input: A: n input points in 

Input: k: the approximation rank 
Input: e: the nominal approximation error 
Output: w G [0,oo)": non-negative weights vector. 
Compute UDV^ = A, the SVD of A 
R G- Dk+l:d,k+l:d 

P G- matrix with i G [n] row is Ui^k+i-.d ■ p§ 7 ) 

X G- matrix with i G [n] row is Xi = Pi/||Pi|l 

w ^ (1,0, ••• ,0) 

for i = 1 ,..., k/e^ do 
j •<— argmiuj {wXXi} 

6 = (1 - ||PA,||| + Er=i w.WPX^Wj.) /IIPIII 
c=\\wxrp 

q; = (1 — a-|-5)/(l-|-c — 2a) 
w <— (1 — a)Ij^. + aw 

end for 

return w _ 


3.1 Implementation. 

We now give a brief overview that bridges the gap between the theoretical results and the practical imple¬ 
mentation presented in Algorithm 1. The coreset construction described in this section use storage of 0{(f') 
per point. However the suggested implementation uses 0{d) space for point. To achieve this we leverage 
several observations: (1) we do not need to compute the center, only its norm, which can we can update 
recursively from a starting value of 1; (2) the term Wi{xixJ)‘^ only needs to be computed 0 {k/e‘^) times, 
and its value stored for further iterations that find the same farthest point. 

The algorithm works as follows. First we restructure the input points A using their k-rank SVD decom¬ 
position UDV^ into a new input matrix P, and normalize it to get X. We arbitrarily select starting point 
Xj = Xi and set its weight to Wj = 1, with all other weights set to zero. Next, we compute the farthest 
point from Xj by projecting the weighted points onto the current point. The following expressions a,b,c 
contain all the update steps in order to compute the norm of the new center recursively without computing 
the actual center itself (an 0{(P) computation). Finally, the value of a is updated based on the ratio of 
distances from the current point, current center, and new center, and the weights are updated based on the 
new a. The algorithm runs for k/e'^ iterations, or until a has converged to 1. The output of the algorithm 
is a sparse vector of weights satisfying the guarantees. 

4 Conclusion 

We present a new approach for dimensionality reduction using coresets. The key feature of our algorithm 
is that it computes coresets that are small in size and subsets of the original data. Using synthetic data as 
ground truth we show that our algorithm provides a good approximation. 
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